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Abstract — Sparse representations with learned dictionaries 
have been successful in several image analysis applications. In 
this paper, we propose and analyze the framework of ensemble 
sparse models, and demonstrate their utility in image restoration 
and unsupervised clustering. The proposed ensemble model 
approximates the data as a linear combination of approximations 
from multiple weak sparse models. Theoretical analysis of the 
ensemble model reveals that even in the worst-case, the ensem- 
ble can perform better than any of its constituent individual 
models. The dictionaries corresponding to the individual sparse 
models are obtained using either random example selection or 
boosted approaches. Boosted approaches learn one dictionary per 
round such that the dictionary learned in a particular round is 
optimized for the training examples having high reconstruction 
error in the previous round. Results with compressed recovery 
show that the ensemble representations lead to a better perfor- 
mance compared to using a single dictionary obtained with the 
conventional alternating minimization approach. The proposed 
ensemble models are also used for single image superresolution, 
and we show that they perform comparably to the recent 
approaches. In unsupervised clustering, experiments show that 
the proposed model performs better than baseline approaches in 
several standard datasets. 

Index Terms — Sparse coding, dictionary learning, ensemble 
models, image recovery, clustering. 

I. INTRODUCTION 

Natural signals and images reveal statistics that allow them 
to be efficiently represented using a sparse linear combination 
of elementary patterns |1 1. The local regions of natural images, 
referred to as patches, can be represented using a sparse linear 
combination of columns from a dictionary matrix. Given a 
data sample x e M*^, and a dictionary matrix D e M^^^^, 
the data approximated using the linear generative model as 
Da, where a G is the sparse coefficient vector This 
generative model that incorporates sparsity constraints in the 
coefficient vector, will be referred to as the sparse model, in 
this paper The dictionary can be either pre-defined or learned 
from the training examples themselves. Learning the dictio- 
nary will be alternatively referred to as learning the sparse 
model. Learned dictionaries have been shown to provide im- 
proved performance for restoring degraded data in applications 
such as denoising, inpainting, deblurring, superresolution, and 
compressive sensing Q, and also in machine learning 
applications such as classification and clustering |]4j-||6). 

A. Sparse Coding and Dictionary Learning 

Using the linear generative model, the sparse code of a data 
sample x can be obtained by optimizing, 

/i(x,D) =min||x-Da||^ + A||a||i. (1) 

a 
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Here ||a||i is the £i penalty that promotes sparsity of the 
coefficients, and the equivalence of ([T]) to Iq minimization has 
been discussed in j?) under some strong conditions on the dic- 
tionary D. Some methods to obtain sparse representations used 
include the Matching Pursuit (MP) |8 |, Orthogonal Matching 
Pursuit (OMP) ||9|, Order-Recursive Matching Pursuit pO) , 
Basis Pursuit (B P) |[TT |, FOCUSS |12j and iterated shrinkage 
algorithms 1^, pj^T 

In several image processing andsample machine learning 
applications, it is advantageous to learn a dictionary, such that 
the set of training examples obtained from a probability space 
have a small approximation error with sparse coding. This 
problem can be expressed as minimizing the objective |15 | 



5(D) = Ex[Mx,D) 



(2) 



where the columns of D, referred to as dictionary atoms, are 
constrained to have unit £2 norm, i.e., ||dj||2 < l,Vj. If the 
distribution in the probability space is unknown and we only 
have T training examples {xi}f^^, each with probability mass 
p(xi), (j2|i can be modified as the empirical cost function. 



5(D) 



T 

E 

1=1 



/i(xi,D)p(xi). 



(3) 



Typically dictionary learning algorithms solve for the sparse 
codes p6| , ITT) using ([T}, and obtain the dictionary by 
minimizing .g(D), repeating the steps until convergence. We 
refer to this baseline algorithm as Alt-Opt. Since this is an 
alternating minimization process, it is important to provide 
a good initial dictionary and this is performed by setting the 
atoms to normalized cluster centers of the data 1 18|. Instead of 
learning dictionaries using sophisticated learning algorithms, 
it is possible to use the training examples themselves as the 
dictionary. Since the number of examples T is usually much 
larger than the number of dictionary atoms K, it is much 
more computationally intensive to obtain sparse representa- 
tions with examples. Nevertheless, both learned and example- 
based dictionaries have found applications in inverse problems 
||2], 1 19 1, pO| and also in machine learning applications such 
as clustering and classification Q, ||5), pTj-p9). 

B. Ensemble Sparse Models 

In this paper, we propose and explore the framework of 
ensemble sparse models, where we assume that data can 
be represented using a linear combination of L different 
sparse approximations, instead of being represented using 
an approximation obtained from a single sparse model. The 
approximation to x can be obtained by optimizing 



min ||x-y^/3;D;a;| 



(4) 
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Here each coefficient vector a; is assumed to be sparse, and 
is obtained by solving for the optimization ([T]l with D; as 
the dictionary. The weights {/3;}^j^ control the contribution 
of each base model to the ensemble. 

Since the ensemble combines the contributions of multiple 
models, it is sufficient that the dictionary for model is obtained 
using a "weak" training procedure. We propose to learn 
these weak dictionaries {Dijf'^i sequentially, using a greedy 
forward selection procedure, such that training examples that 
incurred a high approximation error with the dictionary D; 
are given more importance while learning D;+i. Furthermore, 
we also propose an ensemble model where each individual 
dictionary is designed as a random subset of training samples. 
The formulations described in this paper belong to the category 
of boosting |30J and random selection algorithms [31] in 
machine learning. In supervised learning, boosting is used to 
improve the accuracy of learning algorithms, using multiple 
weak hypotheses instead of a single strong hypothesis. The 
proposed ensemble sparse models are geared towards two 
image analysis problems, the inverse problem of restoring 
degraded images, and the problem of unsupervised clustering. 
Note that, boosted ensemble models have been used with 
the bag-of-words approach for updating codebooks in clas- 
sification \32j and medical image retrieval |33|. However, it 
has not been used so far in sparsity based image restoration 
problems or unsupervised clustering. Also when compared to 
[ [34) , where the authors propose to obtain multiple random- 
ized sparse representations from a single dictionary, in our 
approach, we propose to learn an ensemble of dictionaries 
and obtain a single representation from each of them. Typical 
ensemble methods for regression [35) modify the samples 
in each round of leveraging, whereas in our case the same 
training set is used for each round. 

C. Contributions 

In this work, we propose the framework of ensemble sparse 
models and perform a theoretical analysis that relates their 
performance when compared to its constituent base sparse 
models. We show that, even in the worst case, an ensemble 
will perform at least as well as its best constituent sparse 
model. Experimental demonstrations that support this theory 
are also provided. We propose two approaches for learning 
the ensemble: (a) using a random selection and averaging 
(RandExAv) approach, where each dictionary is chosen as 
a random subset of the training examples, and (b) using a 
boosted approach to learn dictionaries sequentially by modi- 
fying the probability masses of the training examples in each 
round of learning. In the boosted approach, two methods to 
learn the weak dictionaries for the individual sparse models, 
one that performs example selection using the probability 
distribution on the training set (BoostEx), and the other that 
uses a weighted K-means approach (BoostKM), are provided. 
For all cases of ensemble learning, we also provide methods to 
obtain the ensemble weights, {/3i}fLi, from the training exam- 
ples. Demonstrations that show the convergence of ensemble 
learning, with the increase in the number of constituent sparse 
models are provided. Experiments also show that the proposed 



ensemble approaches perform better than their best constituent 
sparse models, as predicted by theory. 

In order to demonstrate the effectiveness of the proposed 
ensemble models, we explore its application to image recovery 
and clustering. The image recovery problems that we consider 
here are compressive sensing using random projections and 
single image superresolution. When boosted ensemble models 
are learned for image recovery problems, the form of degra- 
dation operator specific to the application is also considered, 
thereby optimizing the ensemble for the application. For 
compressive recovery, we compare the performance of the 
proposed Random Example Averaging (RandExAv), Boosted 
Example (BoostEx), and Boosted K-Means (BoostKM) ap- 
proaches to the single sparse model, whose dictionary is 
obtained using the Alt-Opt approach. It is shown that the 
ensemble methods perform consistently better than a single 
sparse model at different number of measurements. Note 
that, the base sparse model for example-based approaches is 
designed as a random subset of examples, and hence it requires 
minimal training. Furthermore, in image superresolution, the 
performance of the proposed ensemble learning approaches is 
comparable to the recent sparse representation methods p6) , 
1201. 

Furthermore, we explore the use of the proposed approaches 
in unsupervised clustering. When the data are clustered along 
unions of subspaces, an £1 graph |29| can be obtained by 
representing each data sample as a sparse linear combina- 
tion of the rest of the samples in the set. Another approach 
proposed in |5| computes the sparse coding -based graph using 
codes obtained with a learned dictionary. We propose to use 
ensemble methods to compute sparse codes for each data 
sample, and perform spectral clustering using graphs obtained 
from them. Results with several standard datasets show that 
high clustering performance is obtained using the proposed 
approach when compared to £1 graph-based clustering. 

II. Analysis of Ensemble Models 

We will begin by motivating the need for an ensemble model 
in place of a single sparse model, and then proceed to derive 
some theoretical guarantees on the ensemble model. Some 
demonstrations on the performance of ensemble models will 
also be provided. 

A. Need for the Ensemble Model 

In several scenarios, a single sparse model may be in- 
sufficient for representing the data, and using an ensemble 
model instead may result in a good performance. The need 
for ensemble models in supervised learning have been well- 
studied |37|. We will argue that the same set of reasons apply 
to the case of ensemble sparse models also. The first reason is 
statistical, whereby several sparse models may have a similar 
training error when learned from a limited number of training 
samples. However, the performance of each of these models 
with test data can be poor By averaging representations 
obtained from an ensemble, we may obtain an approximation 
closer to the true test data. The second reason is computational, 
which can occur with the case of large training sets also. 
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The inherent issue in this case is that sparse modeling is a 
problem with locally optimal solution. Therefore, we may 
never be able to reach the global optimal solution with a 
single model and hence using an ensemble model may result 
in a lesser representation error. Note that this case is quite 
common in dictionary learning, since many dictionary learning 
algorithms only seek a local optimal solution. The third reason 
for using an ensemble model is representational, wherein 
the hypothesis space assumed cannot represent the test data 
sample. In the case of sparse models, this corresponds to the 
case where the dictionary cannot provide a high-fidelity sparse 
approximation for a novel test data sample. This also happens 
in the case where the test observation is a corrupted version of 
the underlying test data, and there is ambiguity in obtaining 
a sparse representation. In this case also, it may be necessary 
to combine multiple sparse models to improve the estimate of 
the test data. 

In order to simplify notation in the following analysis, let 
us denote the approximation in the ensemble model as c/ — 
Djaj. The individual approximations are stacked in the matrix 
C S M^^^-^, where C = [ci . . . c^] and the weight vector 
is denoted as /3 = [/3i . . -/Sl]^- The individual residuals are 
denoted as r/ = x — c;, for i = 1, . . . , L, and the total residual 
of the approximation is given as 

r = X - C/3. (5) 

We characterize the behavior of the ensemble sparse model by 
considering four different cases for the weights 

1) Unconstrained Weights: In this case, the ensemble 
weights {/3i}f"^i are assumed to be unconstrained and com- 
puted using the unconstrained least squares approximation 

mm||x-C/3||^ (6) 

When the data x lies in the span of C, the residual will be 
zero, i.e., r = 0. The residual that has minimum energy in the 
L approximations is denoted as r„ii„. This residual can be 
obtained by setting the corresponding weight in the vector /3 
to be 1, whereas (jSJ computes (3 that achieves the best possible 
residual r for the total approximation. Clearly this implies 

||r||2 < ||r™„|l2. (7) 

Therefore, at worst, the total approximation will be as good 
as the best individual approximation. 

2) Pi > : The ensemble weights {/3/}^]^ are assumed to 
be non-negative in this case. The least squares approximation 
Q, with the constraint /3 > will now result in a zero 
residual if the data x lies in the simplical cone generated 
by the columns of C. The simplical cone is defined as the 
set {b : b = X^^Li Otherwise, the bound on the total 
residual given by (j7|i holds in this case, since rmin can be 
obtained by setting the appropriate weight in /3 to 1 in (jSjl, 
and the rest to in this case also. 

3) A = 1 •' When the ensemble weights are con- 
strained to sum to 1, the total residual can be expressed as 

L 

r^^Ar;. (8) 

1=1 
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Fig. 1 . Performance of the "oracle" ensemble models for various constraints 
on weights and different dictionary sizes in the base models. 

This can be easily obtained by replacing x as 
Denoting the residual matrix H — [ri . . . r^], the optimization 
(|4ji to compute the weights can also be posed as min^ ||R/3||2. 
Incorporating the constraint X^^Li Pi = 1' it can be seen that 
the final approximation C/3 lies in the affine hull generated 
by the columns of C, and the final residual, R/3, lies in the 
affine hull generated by the columns of R. Clearly the final 
residual will be zero, only if the data x lies in the affine hull 
of C, or equivalently the zero vector lies in the affine hull of 
R. When this does not hold, the worst case bound on r given 
by (j7]l holds in this case as well. 

4) Pi > 0, X^^^i A = 1 -■ Similar to the previous case, the 
total residual can be expressed as ([8]). As a result, the final 
representation C/3 lies in the convex hull generated by the 
columns of C, and the final residual, R/3, lies in the convex 
hull generated by the columns of R. Furthermore, the final 
residual will be zero only if the zero vector lies in the convex 
hull of R. Clearly, the worst case bound on r given by (j?]) 
holds in this case. 

Although the worst case bounds for all the four cases are 
the same, the constraint spaces for the cases might provide 
us an idea about their relative performances with real data. 
The first case is unconstrained and it should result in the least 
error. The second case constrains that the solution should lie 
in the simplical cone spanned by the columns of C, and this 
should lead to higher residual energy than Case 1. Case 3 
constrains the solution to lie in an affine hull, which is of L — 1 
dimensions compared to simplical cone in L dimensions, so 
it could lead to a higher error compared to Case 2. Case 4 is 
the subset of constraint spaces for Cases 1 to 3 and hence it 
will lead to the highest residual error 

B. Demonstration of Ensemble Representations 

In order to demonstrate the performance of ensemble repre- 
sentations with real data, we obtain a random set of 100, 000 
patches, each of size 8x8, from a set of natural images. 
The training images were obtained from the superresolution 
toolbox published by Yang et. al. p8) , and consist of a wide 
variety of patterns and textures. We will refer to this set of 
training images simply as the training image set throughout 
this paper. The chosen patches are then processed to remove 
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the mean, followed by the removal of low-variance patches. 
Since image recovery is the important application of the 
proposed models, considering high-variance patches alone is 
beneficial. Each dictionary in the ensemble D, is obtained as 
a random set of K vectorized, and normalized patches. We fix 
the number of models in the ensemble as L — 20. The test data 
is a random set of 1000 grayscale patches obtained from the 
Berkeley segmentation dataset 1*391. For each test sample, we 
compute the set of L approximations using the sparse model 
given in ([T]l, with A = 0.2. The individual approximations 
are combined into an ensemble, under the four conditions 
on the weights, {Pi}, described above. The optimal weights 
are computed and the mean squared norm of the residuals 
for all the test samples are compared in Figure [T] for the 
dictionary sizes K = {256, 1025, 2048}. We observe that the 
performance of the ensembles generally improve as the size of 
the dictionaries used in the base models increase. The variation 
in performance across all the four cases of weights follows our 
reasoning in the previous section. We refer to these as "oracle" 
ensemble models, since the weights are optimally computed 
with perfect knowledge of all the individual approximations 
and the actual data. In reality, the weights will be precomputed 
from the training data. 

III. PROPOSED ENSEMBLE SPARSE 
REPRESENTATION ALGORITHMS 

The ensemble model proposed in (j4]) results in a good 
approximation for any known data. However, in order to use 
ensemble models in analysis and recovery of images, that are 
possibly corrupted or degraded, both the weights and 
the dictionaries {D;}j^j^ must be inferred from uncomipted 
training data. The set of weights is fixed to be common 
for all test observations instead of computing a new set of 
weights for each observation. Let us denote the set of training 
samples as X = [xi X2 ... xy], and the set of coefficients 
in base model ^ as A; = [a;_i a;^2 • ■ • a;.T], where aj^i is 
the coefficient vector of the i* sample for base model /. In 
the proposed ensemble learning procedures, we consider both 
simple averaging and boosting approaches. 

A. Random Example Averaging Approach 

The first approach chooses L random subsets of K samples 
from the training data itself and normalizes them to form 
the dictionaries {D;}j^j^. The weights {/?;} are chosen to be 
equal for all base models as 1/L. Note that the selection of 
dictionaries follows the same procedure as given in the pre- 
vious demonstration (Section [II-B| |. We refer to this ensemble 
approach as Random Example Averaging (RandExAv). 

B. Boosting Approaches 

The next two approaches use boosting and obtain the 
dictionaries and weights sequentially, such that the training 
examples that resulted in a poor performance with the I — 1* 
base model are given more importance when learning the 
base model. We use a greedy forward selection procedure for 
obtaining the dictionaries and the weights. In each round /, the 



model is augmented with one dictionary D;, and the weight 
ai corresponding to the dictionary is obtained. The cumulative 
representation for round I is given by 

X( = (l-aOXi-i+a/DiAi. (9) 

Note that the weights of the greedy forward selection algo- 
rithm, a;, and the weights of the ensemble model, jii, are 
related as 

L 

Pi = ai W {I -at). (10) 

t=i+i 

From (|9]l, it can be seen that X/ lies in the affine hull of X/_i 
and D;A/. Furthermore, from the relationship between the 
weights {ai} and {/?;} given in ( 10 1, it is clear that X^^Li A = 
1 and hence the ensemble model uses the constraints given in 
Case 3. Only the Cases 3 and 4 lead to an efficient greedy 
forward selection approach for the ensemble model in Q, 
and we use Case 3 since it leads to a better approximation 
performance (Figure [TJ. 

In boosted ensemble learning, the importance of the training 
samples in a particular round is controlled by modifying 
their probability masses. Each round consists of (a) learning 
a dictionary D; corresponding to the round, (b) computing 
the approximation for the current round I, (c) estimating 
the weight a/, (d) computing the residual energy for the 
training samples, and (e) updating the probability masses of 
the training samples for the next round. Since the goal of 
ensemble approaches is to have only weak individual models, 
D; is obtained using naive dictionary learning procedures as 
described later in this section. The dictionaries for the first 
round are obtained by fixing uniform probability masses for 
each training example in the first round, (i.e.), pi(xi) — l/T 
for i = {1,2,...,T}. Assuming that D/ is known, the 
approximation for the current round is computed by coding 
the training samples X with the dictionary using ([T]). The 
weight ai is computed such that the error between the training 
samples and the cumulative approximation X/ is minimized. 
Using (j9]), this optimization can be expressed as 



min ||Xi 

ai 



(11) 



and can be solved in closed form with the optimal value given 
as, 

Tr[(X-X,_i)^(D;Az-X,_i)] 



Oil 



|D,A, ~X, 



(12) 



where Tr denotes the trace of the matrix. The residual matrix 
for all the training samples in round I is given by R; = X — 
D/A;. The energy of the residual for the i* training sample is 
given as ei{i) — Ijri.illi- '^^e dictionary in round / provides a 
large approximation error for sample i, then that sample will 
be given more importance in round / + 1. This will ensure 
that the residual error for sample i in round I + 1 will be 
small. The simple scheme of updating the probability masses 

p;+i(xi) — ei{i), upweights the badly represented samples 
and downweights the well-represented ones for the next round. 

Given a training set and its probability masses 

{pi{^i)}f=i, we will propose two simple approaches for 
learning the dictionaries corresponding to the individual sparse 
models. 



IEEE TRANSACTIONS ON IMAGE PROCESSING 



5 




1024 

Dictionary Size 



^HBoosIEx 

BoostEx (min) 
□ BoosIKM 
I ^ BoostKM (min) 
I i RandExAv 

RandExAv (min) 



Mm 




18 20 



Rounds 



Fig. 2. Comparison of tlie performance of the proposed ensemble learning Fig. 3. Convergence characteristics of the proposed ensemble learning 



approaches for various dictionary sizes. 



approaches. 



1) BoostKM: When the sparse code for each training 
example is constrained to take one only one non-zero co- 
efficient of value 1, and the norms of the dictionary atoms 
are unconstrained, the dictionary learning problem ([3]) can be 
shown to reduce to K-Means clustering. Hence, computing a 
set of K-Means cluster centers and normalizing them to unit 
^2 norm constitutes a reasonable weak dictionary. However, 
since the distribution on the data could be non-uniform in our 
case, we need to alter the clustering scheme to incorporate 
this. Denoting the cluster centers to be {^J'k}k=l^ cluster 
membership sets to be {Mk]k=i^ weighted K-Means 
objective is denoted as 



min p(xi)||xi 



(13) 



The weighted K-Means procedure is implemented by mod- 
ifying the scalable K-Means-H- algorithm, also referred to 
as the K-Means 1 1 (K-Means Parallel) algorithm |40]. The K- 
Meansjl algorithm is an improvement over the K-Means-H- 
algorithm pTj that provides a method for careful initialization 
leading to improved speed and accuracy in clustering. The 
advantage with the K-Means || algorithm is that the initial- 
ization procedure is scalable to a large number of samples. 
In fact, it has been shown in [40 1 that just the initialization 
procedure in K-Means || results in a significant reduction in the 
clustering cost. Since we are interested in learning only a weak 
dictionary, we will use the normalized cluster centers obtained 
after initialization as our dictionary. The K-Means-H- algorithm 
selects initial cluster centers sequentially such that they are 
relatively spread out. For initializing K cluster centers, the 
algorithm creates a distribution on the data samples and picks 
a cluster center by sampling it and appends it to the current 
set of centers. The distribution is updated after each cluster 
center is selected. In contrast, the K-Means || algorithm updates 
the distribution much more infrequently, after choosing q 
cluster centers in each iteration. This process is repeated for s 
iterations, and finally the number of cluster centers obtained 
is sq. The chosen centers are re-clustered to obtain the initial 
set of K clusters. It is clear that s must be chosen such that 
sq > K. We provide only the initialization of the weighted K- 
Means|| algorithm that takes the data distribution, {pi{xi)}f^i. 



also into consideration. 

Let us denote 6i as the shortest distance of the i* training 
sample to the set of cluster centers already chosen. The 
initialization of the weighted K-Means || algorithm proceeds 
as follows: 

(a) Initialize M = {}. 

(b) Pick the first center fi^ from the training set based on the 
distribution {piixi)}f^i, and append it to Ai. 

(c) The set of intermediate cluster centers, A4', is created 
using q samples from the data, {xi}f^^, according to the 
probability P'^"^')*^' . 

Ej = iPi(x,)'5- 

(d) Augment the set M ^ Ml) M'. 

(e) Repeat steps 2, 3 and 4 for s iterations. 

(f) Set the weight of each element in the set A4, as the 
sum of weights of samples in X that are closer to /j, than 
any other sample in A4. 

(g) Perform weighted clustering on the elements of Ai to 
obtain the set of K cluster centers, A4. 

Note that the steps (b) and (c) are used to compute the 
initial cluster centers giving preference to samples with higher 
probability mass. Finally, each dictionary atom is set as the 
normalized cluster center jj^^- 

2) BoostEx: From ([3]), it is clear that the learned dictionary 
atoms are close to training samples that have higher proba- 
bilities. Therefore, in the BoostEx method, the dictionary for 
round I is updated by choosing K data samples based on the 
non-uniform weight distribution, {pi(xi)}^]^, and normalizing 
them. This scheme will ensure that those samples with high 
approximation errors in the previous round, will be better 
represented in the current round. 

C. Demonstration of the Proposed Approaches 

The performance of the proposed ensemble schemes for 
dictionaries of three different sizes K = {256, 1024, 2048} 
are compared. The training set described in Section |II-B| is 
used with the RandExAv, BoostKM, and BoostEx schemes. 
The dictionaries {D;}^j^ and the weights are ob- 

tained with the above schemes for L = 20. The individual 
approximations in the training set are obtained using ([TJ with 
the sparsity penalty set as A = 0.2. For each sample in the test 



set described in Section II-B the individual representations are 
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Fig. 4. Illustration of tlie proposed boosted dictionary learning for image restoration. SC denotes sparse coding using \V5) . 



computed using ([T]l with A = 0.2. The ensemble approxima- 
tion the test sample is obtained as ^j^j^ /?/D;a;_i. Figure 
[2] compares the performances of the proposed schemes for 
different dictionary sizes. The minimum error obtained across 
all individual approximations is also shown for comparison, 
with all the three methods and the different dictionary sizes. 
It can be seen that the proposed schemes satisfy the basic 
property of the ensemble discussed in Section [llj where it has 
been shown that the ensemble approximation performs better 
than the best constituent individual approximation. As the 
number of number of approximations in the ensemble increase, 
the average mean squared error (MSB) for the three proposed 
methods reduce, as shown in Figure [3] for a dictionary size 
of 1024. Clearly, increasing the number of models in the 
ensemble results in a better approximation, but the MSB 
flattens out as the number of rounds increase. 

IV. Application: Image Restoration 

In restoration applications, it is necessary to solve an inverse 
problem, in order to estimate the test data y from 



z = *(y) + n, 



(14) 



where $(.) is the corruption operator and n is the additive 
noise. If the operator $(.) is linear, we can represent it using 
the matrix With the prior knowledge that y is sparsely 
representable in a dictionary D according to ([T]l, ( [T4| can be 
expressed as z = ^>Da + n. Restoring x now reduces to 
computing the sparse codes a by solving 



*Da| 



A!|a|| 



(15) 



and finally estimating y = Da |2|. In the proposed ensemble 
methods, the final estimate of x is obtained as a weighted 
average of the individual approximations. Furthermore, in the 
boosting approaches, BoostKM and BoostEx, the degradation 
operation can be included when learning the ensemble. This is 
achieved by degrading the training data as $X, and obtaining 
the approximation with the coefficients computed using (15 1 



instead of ([T]). The procedure to obtain boosted dictionaries 
using degraded data and computing the final approximation is 
illustrated in Figure [4] In this figure, the final approximation 
is estimated sequentially using the weights {ai}f^i, but it is 



equivalent to computing using ( lOi and computing the 

ensemble estimate ^^-^/3;D;A;. 

A. Compressive Recovery 

In compressed sensing (CS), the iV— dimensional obser- 
vation z is obtained by projecting the dimensional data 
y onto a random linear subspace, where iV <C M p2) . In 
this case, the entries of the degradation matrix 4> € K^^*^ 
are obtained as i.i.d. realizations of a Gaussian or Bernoulli 
random variable. Compressive recovery can be effectively 
performed using conventional dictionaries or ensemble dic- 
tionaries. In addition, the proposed idea of ensemble learning 
can be incorporated in existing learning schemes to achieve 
improved recovery performance. In particular, the multilevel 
dictionary learning algorithm |3| can be very easily adapted 
to compute ensemble representations. Before discussing the 
experimental setup, and the results of the proposed methods, 
we will describe the modification to multilevel dictionary 
learning for improving the compressed recovery performance 
with learned dictionaries. 

1) Improved Multilevel Dictionaries: The multilevel dic- 
tionary (MLD) learning algorithm is a hierarchical procedure 
where the dictionary atoms in each level are obtained using a 
1-D subspace clustering procedure |3|. Multilevel dictionaries 
have been shown to generalize well to novel test data, and 
have resulted in high performance in compressive recovery. We 
propose to employ the RandExAv procedure in each level of 
multilevel learning to reduce overfitting and thereby improve 
the accuracy of the dictionaries in representing novel test 
samples. In each level, L different dictionaries are drawn 
as random subsets of normalized training samples. For each 
training sample, a 1— sparse representation is computed with 
each individual dictionary, and the approximations are av- 
eraged to obtain the ensemble representation for that level. 
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TABLE I 

Compressed recovery of standard images: PSNR (dB) obtained using Alternating Dictionary Optimization (Alt-Opt), BoostEx (BEx), 
BoostKM (BKM), RandExAv (RExAv), and Ex-MLD methods, for different values of A''. The results reported were obtained by 

averaging over 10 ITERATIONS WITH DIFFERENT RANDOM MEASUREMENT MATRICES. IN EACH, THE HIGHER PSNR IS GIVEN IN BOLD FONT. 



Image 



N = 8 



Number of Measurements (N) 
N = 16 



N = 32 





Alt-Opt 


BEx 


BKM 


RExAv 


Ex-MLD 


Alt-Opt 


BEx 


BKM 


RExAv 


Ex-MLD 


Alt-Opt 


BEx 


BKM 


RExAv 


Ex-MLD 


Barbara 


2L55 


22.05 


22.04 


22.08 


22.95 


23.52 


23.86 


23.73 


23.68 


24.39 


26.45 


26.53 


26.44 


26.28 


26.66 


Boat 


23.08 


23.73 


23.95 


23.99 


25.08 


25.91 


26.29 


26.57 


26.59 


26.96 


28.79 


29.32 


29.61 


29.61 


29.9 


Couple 


23.15 


23.81 


24.02 


24.05 


25 


25.87 


26.30 


26.57 


26.56 


27.19 


28.83 


29.33 


29.68 


29.67 


29.8 


Fingerprint 


18.10 


18.76 


19.15 


19.16 


20.39 


21.74 


22.18 


22.84 


22.86 


23.19 


25.36 


25.85 


26.59 


26.63 


26.82 


House 


24.52 


25.12 


25.51 


25.52 


26.55 


28.01 


28.14 


28.63 


28.66 


28.93 


31.28 


31.53 


32.01 


32.03 


32.25 


Lena 


25.14 


25.84 


26.18 


26.25 


27.17 


28.31 


28.73 


29.08 


29.13 


29.59 


31.12 


31.77 


32.15 


32.17 


32.65 


Man 


23.90 


24.60 


24.83 


24.89 


25.84 


26.60 


27.12 


27.35 


27.40 


27.68 


29.45 


30.14 


30.40 


30.42 


30.67 


Peppers 


21.31 


21.83 


22.17 


22.23 


23.12 


24.30 


24.54 


24.82 


24.91 


25.68 


27.28 


27.69 


28.03 


28.11 


28.57 




Alt-Opt (24.81 dB) 



BEx (25.7 dB) 



BKM (26.06 dB) 



RExAv (26.15 dB) 



Ex-MLD (27.19 dB) 



Fig. 5. Compressed recovery of Man image using BoostKM dictionaries. The reconstructed images along with their corresponding PSNR are shown for the 
rounds {1, 5, 20, 50}, when 25% random measurements are used. 



Using the residual vectors as the training data, this process 
is repeated for multiple levels. The sparse approximation for 
a test sample is computed in a similar fashion. Since the 
sparse code computation in each level is performed using 
simple correlation operations, the computation complexity is 
not increased significantly by employing ensemble learning. 
In our simulations, we will refer to this approach as Example- 
based Multilevel Dictionary learning {Ex-MLD). 

2) Results: The training set is the same as that described 
in Section |11-B| For the baseline Alt-Opt approach, we train 
a single dictionary with K — 256 using 100 iterations with 
the sparsity penalty Aj^ set to 0.1. The ensemble learning 
procedures BoostEx, BoostKM and RandExAv are trained with 
L — 50 and K = 256 for sparsity penalty Xtr = 0.1. The 
boosted ensembles are trained by taking the random projection 
operator into consideration, as discussed in Section |IV] for 
the reduced measurements, N = {8, 16, 32}. For the Ex-MLD 
method, both the number of levels and the number of atoms in 
each level were fixed at 16. In each level, we obtained i = 50 
dictionaries to compute the ensemble representation. 

The recovery performance of the proposed ensemble models 
is evaluated using the set of standard images shown in Table 
[l] Each image is divided into non-overlapping patches of size 
8x8, and random projection is performed with the number 
of measurements set at = {8, 16, 32}. For the Alt-Opt 



procedure, the individual patches are recovered using ( 15 i, and 
for the ensemble methods, the approximations computed using 
the L individual dictionaries are combined. The penalty Ate is 
set to 0.1 for sparse coding in all cases. For each method, the 
PSNR values were obtained by averaging the results over 10 



iterations with different random measurement matrices, and 
the results are reported in Table |l] It was observed that the 
proposed ensemble methods outperform the Alt-Opt methods 
in all cases. In particular, we note that the simple RandExAv 
performs better than the boosting approaches, although in 



Section 11-B it was shown that boosting approaches show a 
superior performance. The reason for this discrepancy is that 
boosting aggressively reduces error with training data, and 
hence may lead to overfitting with degraded test data. Whereas, 
the RandExAv method provides the same importance to all 
individual approximations both during the training and the 
testing phases. As a result, it provides a better generalization in 
the presence of degradation. We also note that similar behavior 
has been observed with ensemble classification methods (43). 
Random sampling methods such as bagging perform better 
than boosting with noisy examples, since bagging exploits 
classification noise to produce more diverse classifiers. Fur- 
thermore, we observed that the proposed Ex-MLD method 
performed significantly better than all approaches, particularly 
for lower number of measurements. Figure |5] shows the images 
recovered using the different approaches, when N was fixed at 
8. As it can be observed, the Ex-MLD and RandExAv methods 
provide PSNR gains of 2.38dB and 1.34dB respectively, when 
compared to the Alt-Opt approach. 

B. Single Image Superresolution 

Single image superresolution (SISR) attempts to reconstruct 
a high-resolution image using just a single low-resolution 
image. It is a severely ill-posed problem and in sparse 
representation based approaches, the prior knowledge that 
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4051/256 16425/512 40032/1024 

# Training Samples/# Dictionary Atoms 



Fig. 6. Effect of dictionary and training set sizes on tlie dictionary training 
time for different learning scliemes. The training times given are in seconds 
and are compared only for PairDict (40 iterations), BoostEx (L = 50) and 
BoostKM (L = 50) since ExDict and RandExAv require no training. 



natural image patches can be represented as a sparse linear 
combination of elementary patches, is used. The degraded test 
image is represented as Z = , where the operator $ the 
blurs the high-resolution image Y and then downsamples it. 
Note that Y and Z denote vectorized high- and low-resolution 
images respectively. Each overlapping patch obtained from 
the degraded image is denoted as z. The paired dictionary 
learning procedure (PairDict) proposed in \^\ has been 
very effective in recovering the high-resolution patches. This 
method initially creates degraded counterparts of the high- 
resolution training images, following which gradient-based 
features are extracted from the low-resolution patches and the 
features are appended to the corresponding vectorized high- 
resolution patches. These augmented features are used to train 
a paired dictionary ( ) such that each low -re solution and its 
corresponding high-resolution patches share the same sparse 
code. For a low-resolution test patch z, the sparse code a 
is obtained using Djo, and the corresponding high-resolution 
counterpart is recovered as y = D/i^a. An initial estimate 
Yq of the high-resolution image is obtained by appropriately 
averaging the overlapped high-resolution patches. Finally, a 
global reconstruction constraint is enforced by projecting Yo 
on to the solution space of $Y = Z, 



mm 

Y 



Z *Y 



c||Y-Yo 



2 
2' 



(16) 



to obtain the final reconstruction. As an alternative, the 
example-based procedure {ExDict) proposed in |20j , the dic- 
tionaries Dio and D/ii are directly fixed as the features 
extracted from low-resolution patches and vectorized high res- 
olution patches respectively. Similar to the PairDict method, 
the global reconstruction constraint in ( 16i is imposed for the 
final reconstruction. 

In our simulations, standard grayscale images (Table are 
magnified by a factor of 2, using the proposed approaches. In 
addition to the PairDict and ExDict methods, simple bicubic 
interpolation is also used as a baseline method. We also 
obtained paired dictionaries with 1024 atoms using 100, 000 
randomly chosen patches of size 5x5 from the grayscale 
natural images in the training set. The sparsity penalty used 
in training was \tr ~ 0.15. The training set was reduced to 
the size of 20, 000 samples and used as the dictionary for the 



TABLE II 

SUPERRESOLUTION OF STANDARD IMAGES UPSCALED BY A FACTOR OF 2: 
PSNR IN DB OBTAINED WITH BICUBIC INTERPOLATION {Bicubic), PAIRED 
DICTIONARY (PairDict) |36|, EXAMPLE DICTIONARY (ExDict) (20), 
BOOSTEX (BEx), BOOSTKM (BKM), AND RandExAv (RExAv) 
METHODS. 



Image 


Bicubic 


PairDict 


ExDict 


BEx 


BKM 


RExAv 


Lena 


34.10 


35.99 


35.99 


35.97 


35.95 


35.99 


Boat 


29.94 


31.34 


31.34 


31.28 


31.23 


31.29 


House 


32.77 


34.49 


34.49 


34.38 


34.41 


34.41 


Cameraman 


26.33 


27.78 


27.78 


27.72 


27.78 


27.71 


Straw 


24.20 


25.93 


25.93 


25.90 


25.90 


25.94 


Girl 


33.81 


35.39 


35.39 


35.33 


35.37 


35.35 



ExDict method. For ensemble learning, L was fixed at 50 and 
the approximation for each data sample was obtained using 
just a 1— sparse representation. 

For different number of training samples, we compared the 
training times for PairDict (40 iterations), BoostEx (L — 50) 
and BoostKM {L = 50) algorithms in Figure [6] The computa- 
tion times reported in this paper were obtained using a single 
core of a 2.8 GHz Intel i7 Linux machine with 8GB RAM. 
The BoostKM approach has the maximum computational 
complexity for training, followed by PairDict and BoostEx 
approaches. The ExDict procedure requires no training and for 
RandExAv, training time is just the time for randomly selecting 
K samples from the training set of T samples, for L rounds. 
Clearly, the complexity incurred for this is extremely low. 

For the test images, SISR is performed using the baseline 
PairDict and ExDict approaches using a sparsity penalty of 
Ate — 0.2. For the PairDict, and ExDict approaches, the code 
provided by the authors p8) was used to generate the results. 
The recovery performance of the proposed algorithms are re- 
ported in Table [III ^'^^ PairDict, as well the proposed ensemble 
methods, the dictionary size is fixed at 1024, whereas all 
the examples are used for training with the ExDict approach. 
We observed from our results that an ensemble representation 
with a simplified sparse coding scheme (1-sparse) matched the 
performance of the baseline methods (Figure |7|l. 

V. Application: Unsupervised Clustering 

Conventional clustering algorithms such as K-Means pro- 
vide good clusterings only when the natural clusters of the 
data are distributed around a mean vector in space. For data 
that lie in a union of low-dimensional subspaces, it is beneficial 
to develop clustering algorithms that try to model the actual 
data distribution better The sparse subspace clustering method 
p4| , a special case of which is referred to as the £i graph 
clustering (j29|, results in clusters that correspond to subspaces 
of data. This is achieved by representing each example as a 
sparse linear combination of the others and finally performing 
spectral clustering using a similarity matrix obtained from the 
coefficient matrix. The clustering method has the advantage 
of incorporating the noise model directly when performing 
sparse coding, thereby achieving robustness. The coefficient 
vector for the i* data sample is obtained as 

min||xi — Xaj|| + A||ai||i, subj. to. an = 0. (17) 



IEEE TRANSACTIONS ON IMAGE PROCESSING 



9 





Degraded 



Original 




1^1 

PairDict (27.78 dB) 





ExDict (27.78 dB) BoostKM (27.78 dB) 



Fig. 7. SISR of the Man image with scahng factor of 2. The PairDict, ExDict, and RandExAv methods resuh in very similar high resolution images. 



By imposing the constraint that the element of should 
be 0, we ensure that a data sample is not represented by 
itself, which would have resulted in a trivial approximation. 
The coefficient matrix is denoted as A = [aia2 . . . a^], and 
spectral clustering |45J is performed by setting the similarity 
matrix to the symmetric non-negative version of the coefficient 
matrix, S = |A| + |A-^|. Computing the graph in this case 
necessitates the computation of sparse codes of T data samples 
with a A/ X (T — 1) dictionary. Sparse coding-based graphs 
can also constructed based on coefficients obtained with a 
dictionary D, inferred using the Alt-Opt procedure. Denoting 
the sparse codes for the examples X by the coefficient matrix 
A ~ [ai...aT], the similarity matrix can be constructed 
as S = |A-^A|. Similar to the £i graphs, this similarity 
matrix can be used with spectral clustering to estimate the 
cluster memberships In this case, the dominant complexity 
in computing the graph is in learning the dictionary, and 
obtaining the sparse codes for each example. When the number 
of training examples is large, or when the data is high- 
dimensional, approaches that use sparse coding-based graphs 
incur high computational complexity. 

We propose to construct sparse representation-based graphs 
using our ensemble approaches and employ them in spec- 
tral clustering. In our ensemble approaches, we have two 
example-based procedures, (RandExAv and BoostEx) and one 
that uses K-Means dictionaries (BoostKM). For BoostKM, 
we obtain L dictionaries of size K using the boosting pro- 
cedure, with 1— sparse approximations. The final coefficient 
vector of length LK for the data sample is obtained 



as, a, = 



. . a 



T iT 



] , where a; ^ is the coefficient 
vector for round I. The similarity matrix is then estimated 
as S = |A^A|. In the example-based procedures, again 
1— sparse representation is used to obtain the coefficient vec- 
tors {ai j,a2_i . . . ,ai i}, for a data sample x^. A cumulative 
coefficient vector of length T can be obtained by recognizing 
that each coefficient in aj ^ G M^, can be associated to a 
particular example, since D; is an example-based dictionary. 
Therefore a new 1— sparse coefficient vector a; ^ e is 
created such that D/a; ^ — Xa; j, where X contains the 
normalized set of data samples X. Finally the cumulative 
coefficient vector for Xj is obtained as They are 

then stacked to form the coefficient matrix A = [ai . . . a^]. 
Spectral clustering can be now performed using the simi- 
larity matrix S = |A| + |A^|. The clustering performance 
was evaluated in terms of accuracy and normalized mutual 
information (NMI), and compared with £i graphs. As seen 



TABLE 111 

Comparison of the clustering performances (accuracy and 
normalized mutual information) of the algorithms with 
standard datasets. the best maximum or average performance 
is given in bold font. 



Dataset 


h graph 

max avg 


BoostEx 

max avg 


RandExAv 

max avg 


BoostKJM 

max avg 


Accuracy 


Digits 
Soybean 
Segment 
Satimage 

USPS 


88.72 74.61 
67.44 58.22 
65.32 57.84 
77.56 69.37 
78.24 62.47 


88.63 76.79 
67.26 63.21 
63.42 58.46 
75.03 72.88 
75.01 68.18 


88.62 77.56 
70.82 65.31 
63.33 56.49 
83.59 75.25 

78.26 64.14 


88.31 76.85 
69.22 63.50 
65.63 58.67 
71.62 65.32 
90.96 75.18 


NMI 


Digits 
Soybean 
Segment 
Satimage 

USPS 


84.97 76.50 
74.55 65.94 
59.14 53.91 
65.09 61.20 
81.04 74.14 


84.69 77.67 
72.84 68.66 
56.72 53.23 
62.98 59.01 
70.52 66.89 


84.73 78.04 
77.50 73.71 

54.94 51.20 
69.38 66.12 

82.67 76.95 


84.31 78.02 
76.05 71.66 
58.44 55.32 
60.27 55.57 
82.78 78.37 



from Table [Hi] the ensemble-based approaches result in high 
accuracy as well as NMI. In all our simulations, data was 
preprocessed by centering and normalizing to unit norm. 
It was observed that the proposed ensemble methods incur 
comparable computational complexity to li graphs for datasets 
with small data dimensions. However, we observed significant 
complexity reduction with the USPS dataset, which contains 
9298 samples of 256 dimensions. To cluster the USPS samples, 
the li graph approach took 411.85 seconds to compute the 
sparse codes, whereas BoostEx, RandExAv, and BoostKM took 
152.56, 147.93, and 83.58 seconds respectively. This indicates 
the suitability of the proposed methods for high-dimensional, 
large scale data. 

VI. Conclusions 

We proposed and analyzed the framework of ensemble 
sparse models, where the data is represented using a linear 
combination of approximations from multiple sparse represen- 
tations. Theoretical results and experimental demonstrations 
show that an ensemble representation leads to a better approx- 
imation when compared to its individual constituents. Three 
different methods for learning the ensemble were proposed. 
Results in compressive recovery showed that the proposed 
approaches performed better than the baseline sparse coding 
method. Furthermore, the ensemble approach performed com- 
parably to several recent techniques in single image superres- 
olution. Results with unsupervised clustering also showed that 



IEEE TRANSACTIONS ON IMAGE PROCESSING 



10 



the proposed method leads to better clustering performance in 
comparison to the li graph method. 
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